“I’ve missed more than 9000 shots in my career. I’ve lost almost 300 games. 26 times, I’ve been trusted to take the game winning shot and missed. I’ve failed over and over and over again in my life. And that is why I succeed.” - Michael Jordan
Michael Jordan hits the game winner over Craig Ehlo during Game 5 of a 1989 playoff series.
In sports analytics, one of the first steps in understanding and analyzing data is manipulation—transforming raw data into a form suitable for analysis. The ability to manipulate data efficiently is vital for uncovering insights in sports, such as analyzing player statistics, team performance, and game outcomes.
In this chapter, we introduce dplyr, a powerful R package designed for data manipulation. dplyr allows you to perform a range of common data wrangling tasks in a clear, concise, and readable way. Using dplyr functions, you can filter observations, select variables, compute summaries, and transform datasets. These operations will form the foundation for more advanced analyses in later chapters.
The dplyr package provides several key functions for manipulating data: - mutate() to add new variables or transform existing ones. - select() to choose specific columns of data. - filter() to subset the data based on conditions. - summarise() to aggregate data. - arrange() to reorder rows of data.
In this chapter, we will explore each of these functions in detail using sports data from the Lahman package, a commonly used dataset in sports analytics. The Lahman package includes historical Major League Baseball (MLB) data, such as player statistics, team performance, and other key metrics.
1.1 Using mutate() to Add or Transform Variables
The mutate() function in dplyr is an essential tool for creating new variables or modifying existing ones in a dataset. In the context of sports analytics, this function is particularly useful when you want to compute new metrics or adjust raw data into more meaningful forms. For example, a player’s raw performance statistics, such as hits or points scored, can be transformed into ratios or averages, providing more insightful performance measures such as batting averages or efficiency ratings.
Here, data is the name of the dataset you are working with, new_variable is the name of the variable you are adding or modifying, and expression is the formula used to calculate the value of the new variable.
We can also use the pipe operator as follows:
data|>mutate(new_variable =expression)
1.1.2 Why Use mutate()?
In sports analytics, it is often necessary to create variables that represent metrics or statistics derived from existing data. For example:
Creating a new variable that represents a player’s scoring efficiency.
Calculating a team’s win rate by dividing the number of wins by the total number of games played.
Computing advanced metrics such as player efficiency rating (PER) in basketball or on-base plus slugging (OPS) in baseball.
By transforming raw data into meaningful statistics, you can uncover new insights that are not immediately apparent in the original dataset.
Example 1.1 (Calculating Batting Average) Let’s begin with a basic example using data from the Lahman package, which contains Major League Baseball (MLB) statistics. We will calculate each player’s batting average (BA), a widely used performance metric that represents the ratio of a player’s hits (H) to at-bats (AB).
In this example, we use mutate() to create a new variable BA, which stores the batting average for each player. This new column is derived from dividing the number of hits (H) by the number of at-bats (AB).
1.1.3 Handling Missing Values with mutate()
In some cases, you may encounter missing or undefined values in your dataset. When using mutate() to create new variables, it’s important to handle these missing values to avoid errors or incorrect calculations. In R, you can use the na.rm = TRUE argument in relevant functions (like sum() or mean()) to remove missing values before performing calculations.
Example 1.2 (Handling division by zero) Let’s modify the batting average calculation to handle cases where the number of at-bats (AB) is zero, which would otherwise lead to a division by zero error.
In this case, we use ifelse() to check if a player has zero at-bats (AB == 0). If this condition is met, we assign a missing value (NA) to the batting average, otherwise, the batting average is calculated as usual.
1.1.4 Multiple Transformations with mutate()
The mutate() function can be used to perform multiple transformations in a single call. This is useful when you need to create several new variables based on existing data. You can also overwrite existing variables if needed.
Example 1.3 (Performing multiple transformations with mutate())
In this example, we calculate several new statistics—batting average (BA), on-base percentage (OBP), slugging percentage (SLG), and On-base Plus Slugging (OPS)—all within the same mutate() call.
The mutate() function is a versatile and powerful tool for creating new variables and transforming existing ones. In sports analytics, this ability is invaluable when calculating advanced statistics that provide deeper insights into player and team performance. Whether you’re calculating simple metrics like batting averages or more complex statistics like on-base plus slugging, mutate() allows you to efficiently manipulate and analyze your data. In the next sections, we will explore additional dplyr functions that complement mutate(), such as select(), filter(), summarise(), and arrange().
1.2 Using select() to Choose Variables
The select() function in dplyr is a powerful tool for selecting specific columns (or variables) from a dataset. In sports analytics, datasets can contain a large number of variables, many of which may not be immediately relevant to your analysis. By using select(), you can streamline your workflow, focusing on only the columns that are essential for the task at hand.
The function allows you to extract specific columns based on their names, making the data more manageable and readable, especially when you’re working with large datasets like those from the NBA, MLB, or other sports leagues.
Here, data is the name of the dataset, and column1, column2, etc., represent the columns you wish to select from the dataset.
1.2.2 Why Use select()?
In sports analytics, it is common to work with large datasets that contain numerous variables, such as player names, performance metrics, team information, game statistics, and more. Often, you only need a few of these columns for a specific analysis. By using select(), you can focus on the relevant variables, simplifying your data and improving performance when working with large datasets.
For example:
Selecting player performance statistics such as points, assists, and rebounds.
Narrowing down a dataset to include only specific game information, such as game date, team, and score.
Choosing variables that represent team-level statistics over a season, like win totals or offensive efficiency.
Example 1.4 (Selecting Key Batting Statistics) Let’s use the Lahman dataset again, which contains a wealth of baseball statistics. Suppose we are only interested in a few key batting statistics, such as player ID (playerID), year (yearID), at-bats (AB), hits (H), and home runs (HR). We can use select() to extract these columns:
In this example, the select() function filters the dataset down to just the columns playerID, yearID, AB, H, and HR. This is especially useful when you’re conducting specific analyses on player performance metrics like at-bats (AB), hits (H), and home runs (HR).
1.2.3 Selecting Columns by Range
You can also select columns by specifying a range using the colon (:) operator. This is helpful when dealing with consecutive columns that you wish to include in your analysis.
Example 1.5 (Selecting a range of columns) Let’s say you want to select all the columns between AB and HR in the batting dataset.
In this example, the select(AB:HR) syntax selects all columns from AB to HR inclusively. This feature is handy when you want to work with a block of consecutive columns.
1.2.4 Dropping Columns
In some cases, you might want to exclude certain columns from your dataset rather than select specific ones. You can do this using the - sign in combination with select() to drop unwanted columns.
Example 1.6 (Dropping a specific column) Let’s say you want to keep all columns except for the player’s ID (playerID).
Here, select(-playerID) removes the playerID column from the dataset, returning all other columns. This is useful when you want to retain most of the dataset but exclude a few irrelevant variables.
1.2.5 Using Helper Functions with select()
dplyr also provides helper functions that allow you to select columns based on certain patterns or criteria. These include:
starts_with(): Select columns whose names start with a certain prefix.
ends_with(): Select columns whose names end with a certain suffix.
contains(): Select columns whose names contain a certain string.
matches(): Select columns that match a regular expression.
These functions are particularly useful when working with datasets where columns follow a naming convention, such as stats that begin with “player_” or end with “_rate.”
Example 1.7 (Selecting Columns that Start with a Prefix) Suppose you are working with a basketball dataset and want to select all statistics related to shooting that start with field_goals. In this example, we will use the hoopR package in R. This package provide functions for fetching NBA and men’s college basketball data.
In this example, select(starts_with("field_goals")) extracts all columns whose names begin with “field_goals,” such as field_goals_made, and field_goals_attempted.
The select() function is a highly flexible tool for extracting relevant columns from large datasets, making it easier to work with and analyze the data that matters most. Whether you’re working with baseball, basketball, or any other sport, select() can streamline your workflow, helping you focus on the specific variables that are critical to your analysis. By combining select() with helper functions, such as starts_with(), ends_with(), and others, you can perform more advanced selections based on patterns or column names, further enhancing your ability to manipulate sports data.
In the next section, we’ll explore how to subset your data using filter() to focus on specific cases or observations based on conditions, such as selecting players above a certain performance threshold.
1.3 Using filter() to Subset the Data
The filter() function in dplyr is used to subset rows in a dataset based on specified conditions. This function allows you to extract observations that meet certain criteria, which is crucial in sports analytics when you need to focus on specific players, teams, or events. Whether you’re interested in filtering by player performance, team rankings, or game outcomes, filter() enables you to work with just the relevant data for your analysis.
Here, data is the dataset, and condition1, condition2, etc., are the conditions that the rows must meet to be included in the subset. Conditions typically involve logical comparisons, such as >, <, ==, !=, and %in%.
1.3.2 Why Use filter()?
In sports analytics, datasets often contain a wealth of information about many players, games, or seasons. However, your analysis might focus on a specific subset of this data, such as:
Filtering players with more than a certain number of points or minutes played.
Focusing on teams with a winning percentage above a specific threshold.
Subsetting games where a team scored over 100 points.
Isolating data from a particular season, team, or player.
By using filter(), you can isolate the rows that are most relevant to your analysis, making it easier to conduct meaningful evaluations or comparisons.
Example 1.8 (Filtering Players with More than 200 At-Bats) Let’s return to the Lahman dataset and use filter() to subset players who had more than 200 at-bats in a given season. This kind of filtering is useful when analyzing only those players who had significant playing time.
In this example, we use filter(AB > 200) to keep only the rows where players had more than 200 at-bats (AB). This ensures that our analysis focuses on players with sufficient playing time, avoiding the inclusion of those who had limited opportunities.
1.3.3 Filtering on Multiple Conditions
You can use filter() to apply multiple conditions at once, allowing for more precise subsetting. Multiple conditions can be combined with logical operators such as & (AND) and | (OR).
Example 1.9 (Filtering Players with More than 200 At-Bats and More Than 20 Home Runs)
Here, we use filter(AB > 200 & HR > 20) to select players who had more than 200 at-bats and also hit more than 20 home runs in a season. This type of filtering is useful for identifying high-performing players who both played regularly and demonstrated power-hitting ability.
1.3.4 Filtering with Logical Operators
You can apply more complex logical conditions with filter(). For example, if you wanted to find players who either had more than 200 at-bats or hit more than 20 home runs (but not necessarily both), you could use the OR operator |.
In this example, players are included in the subset if they meet either of the conditions: more than 200 at-bats or more than 20 home runs.
1.3.5 Using %in% to Filter by a Set of Values
The %in% operator can be used within filter() to check whether a column’s values belong to a specific set. This is particularly useful when you want to filter by specific categories, such as player IDs, teams, or years.
Example 1.11 (Filtering Data for Specific Years) Let’s say you want to analyze data from only the 2010 and 2015 seasons.
In this example, we use filter(yearID %in% c(2010, 2015)) to retain only the rows where the yearID column matches 2010 or 2015. This approach is helpful when working with data from specific seasons or games.
1.3.6 Filtering Missing Data
In many datasets, missing values (represented as NA in R) can cause problems if not handled properly. You can use filter() to exclude rows with missing values or to isolate rows where certain values are missing.
Example 1.12 (Filtering Out Missing Data) Let’s filter out rows where the number of at-bats (AB) is missing.
# Removing rows with missing values for at-batsbatting_no_na=Batting|>filter(!is.na(AB))head(batting_no_na)
Here, we use filter(!is.na(AB)) to remove rows where the value for at-bats (AB) is missing (NA). This ensures that we are working only with complete data.
The filter() function is an indispensable tool for narrowing down large datasets based on specific criteria. Whether you’re filtering by player performance, season, or game outcomes, filter() allows you to extract just the rows that are most relevant to your analysis. In sports analytics, this enables you to focus on specific players, teams, or metrics that drive meaningful insights.
In the next section, we will explore how to use the summarise() function to aggregate data, such as calculating team-level statistics or summarizing player performance over a season or career.
1.4 Using summarise() to Aggregate Data
The summarise() function in dplyr is used to create summary statistics from your dataset by collapsing multiple rows into a single value. This function is essential in sports analytics, where you often need to aggregate data to calculate overall metrics, such as total points, average performance statistics, or team-level summaries. With summarise(), you can compute a wide variety of statistics such as sums, averages, minimums, and maximums, allowing for powerful aggregation of data.
new_summary is the name of the new summary statistic.
aggregation_function(column) is the function applied to the column you want to summarize (e.g., mean(), sum(), min(), max()).
1.4.2 Why Use summarise()?
In sports analytics, it is often necessary to calculate overall performance metrics, such as:
Summing a player’s total points over multiple seasons.
Calculating a team’s average points per game.
Aggregating player data to find league-wide averages or totals.
Computing summary statistics like the highest or lowest scoring games in a season.
Using summarise(), you can efficiently compute these summary statistics and gain high-level insights from raw data.
1.4.3 Grouping Data Before Summarizing
When using summarise(), it’s common to group data by one or more variables before applying the summary function. This is achieved with the group_by() function, which allows you to summarize data across different categories, such as by player, team, or season.
For example:
Summarizing total points per player across multiple games.
Calculating the average performance metrics per team in a season.
Aggregating statistics by season to track trends over time.
Example 1.13 (Summarizing Total Home Runs per Year) Let’s use the Lahman dataset to calculate the total number of home runs hit by all players for each year. First, we will group the data by year and then use summarise() to compute the total home runs.
In this example, we first use group_by(yearID) to group the data by year, and then summarise() is used to compute the total number of home runs (HR) hit in each year. The na.rm = TRUE argument ensures that missing values are ignored during the summation. This type of summary can help analyze trends in home run performance over time.
Example 1.14 (Calculating Average Batting Average per Year) You can also calculate averages using summarise(). Let’s calculate the average batting average (BA) for each year by grouping the data by yearID and then using the mean() function within summarise().
# Calculating average batting average per yearbatting_average_per_year<-Batting|>group_by(yearID)|>summarise(avg_BA =mean(H/AB, na.rm =TRUE))head(batting_average_per_year)
In this example, we compute the average batting average (avg_BA) for each year. The mean() function calculates the average of hits divided by at-bats, and na.rm = TRUE ensures that missing data is handled appropriately.
Multiple Summaries with summarise()
You can compute multiple summary statistics in a single summarise() call by simply listing additional summary expressions separated by commas.
Example 1.15 (Summarizing Both Total and Average Home Runs per Year)
# Summarizing total and average home runs per yearhome_run_stats<-Batting|>group_by(yearID)|>summarise(total_HR =sum(HR, na.rm =TRUE), avg_HR =mean(HR, na.rm =TRUE))head(home_run_stats)
In this example, we calculate both the total home runs (total_HR) and the average home runs per player (avg_HR) for each year. This provides both a high-level view (total) and a per-player perspective (average) of home run performance.
1.4.4 Adding Summary Statistics to the Original Data
In some cases, you may want to add summary statistics back to your original dataset. This can be achieved by combining mutate() with summarise() to append calculated statistics to each row.
Example 1.16 (Adding Average Points per Player to Each Team)
In this example, we first calculate the average points per team (avg_pts_team) and then use right_join() to merge this summary statistic back into the original dataset, so that each player’s row now contains their team’s average points.
The summarise() function is a critical tool for aggregating and summarizing data in sports analytics. Whether you’re calculating total points, average performance metrics, or team-level statistics, summarise() allows you to collapse large datasets into meaningful summary statistics. By combining summarise() with group_by(), you can gain insights into trends across seasons, teams, or players, providing a comprehensive view of the data.
In the next section, we’ll explore the arrange() function, which is used to order and rank your data, making it easier to identify top performers and trends.
1.5 Using arrange() to Order Rows
The arrange() function in dplyr is used to reorder the rows of a dataset based on the values of one or more variables. This function is particularly useful in sports analytics when you want to rank players or teams according to performance metrics, such as points scored, assists made, or wins achieved. By using arrange(), you can quickly identify top performers, sort data chronologically, or rank teams based on standings.
column1, column2, etc., are the columns by which the data should be ordered.
By default, arrange() sorts the rows in ascending order. To sort in descending order, you can wrap the column name in desc().
1.5.2 Why Use arrange()?
In sports analytics, ranking and sorting data is a common task. Whether you’re determining the highest scorers, ordering teams by win percentage, or ranking players by efficiency, arrange() allows you to easily reorder your data to facilitate comparisons and insights.
For example, you may want to:
Rank basketball players by points scored per game.
Sort baseball teams by total wins.
Rank athletes by efficiency ratings, from highest to lowest.
Example 1.17 (Arranging by Batting Average) Let’s use the Lahman dataset again and sort the players by their batting average (BA). First, we need to calculate the batting average using mutate() and then sort the dataset in descending order to identify the top performers.
In this example, we first calculate the batting average (BA) using mutate() and then use arrange(desc(BA)) to sort the data in descending order of batting average. This allows us to see the players with the highest batting averages at the top of the dataset.
1.5.3 Arranging by Multiple Variables
You can also sort data by multiple columns. For example, if two players have the same batting average, you might want to break the tie by looking at their home run totals.
Example 1.18 (Arranging by Batting Average and Home Runs)
Here, we first sort by batting average (BA) in descending order, and in the case of ties, we further arrange the rows by home runs (HR) in descending order. This type of multi-variable sorting is useful for resolving ties or ranking data based on multiple criteria.
1.5.4 Sorting Chronologically
In many sports datasets, you might want to sort data chronologically to observe trends over time. This can be done by arranging data by a date or year variable.
Example 1.19 (Sorting by Year) Let’s sort the ‘Lahman’ dataset by the yearID variable to view the data in chronological order.
This example sorts the data in ascending order by yearID, allowing us to see the dataset arranged chronologically by season. Sorting by year or date is especially useful when analyzing time trends in performance or observing how a player’s career evolves over time.
The arrange() function is a valuable tool in sports analytics for ordering and ranking data. Whether you’re sorting players by performance metrics, ranking teams based on wins, or organizing data chronologically, arrange() allows you to structure your data in a way that facilitates meaningful analysis. You can arrange data by one or more columns, and by using desc(), you can sort in descending order to prioritize top performers.
1.6 Concept Quiz
What is the primary function of the mutate() function in dplyr?
Which function in dplyr is used to select specific columns from a dataset?
In the context of the mutate() function, what would happen if you attempt to divide a value by zero without handling it?
Which of the following functions is used to reorder rows in a dataset?
To extract rows of data that meet specific conditions, such as “players with more than 200 at-bats”, which function would you use?
True or False: The select() function allows you to drop specific columns by prefixing their names with a minus sign (-).
True or False: In the filter() function, using & combines conditions in such a way that both conditions must be true for a row to be included in the subset.
True or False: The arrange() function in dplyr can only sort data in ascending order.